There are lots of options for data visualization. They generally trade-off the ease of use and the configurability needed to produce polished graphs for publication. In my daily workflow, I use different options for different purposes. During analysis, I prefer quick and dirty approaches that allow me to see the data, but I approaches that give me better control when I want to create figures for a paper. This also sets some expectation. While a simple visualisation during the analysis should only take one or two lines of code, the finished figures usually require quite a bit of code to make them look just right. For this tutorial, I will focus on making publication-ready figures.
I tried different languages and different packages for creating figures. In my experience, the native R plot and the commonly used ggplot package do not provide sufficient control. I never found a way to define the exact dimensions and resolution of a figures, which is often necessary for creating figures according to journal specification. Instead, I use the matplotlib and seaborn packages for Python. There syntax is relatively easy and they provide full control over the aestethics.
If you never used Python before, it might seem a bit daunting. But using modern Python distributions, makes it really easy to get started.
Download anaconda: Anaconda is a Python distribution that comes with installers for all major operating systems. You can download it from here: https://www.anaconda.com/distribution/
Install additional packages: There are many additional packages that extend the functionality of Python. Anaconda comes with a set of the most commonly used packages for science, but I recommend one additional one for plotting, namely seaborn. To install it, open a terminal window and type "conda install seaborn"
In this tutorial, we will go through some steps to visualize clustering solutions. I'll use the iris dataset, because it is a commonly used and relatively easy to use dataset but the same methods should apply to other datasets, e.g. using behavioural or cognitive data.
Show the plots within the notebook:
%pylab inline
We are loading the iris data using the convenience function in sklearn.
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()
features = pd.DataFrame(data.data)
cluster_labels = data.target
First, we may want to show the feature profile of each cluster
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import sem, zscore
import seaborn as sns
A few definitions help to define the plot and make it publication-ready
sns.set_style('white')
from matplotlib import rcParams
rcParams['font.family'] = 'serif'
rcParams['font.serif'] = ['CMU Serif'] # define the font for the plots
rcParams['text.usetex'] = True # using Latex as a backend creates better type setting
rcParams['axes.labelsize'] = 9 # define the font size
rcParams['xtick.labelsize'] = 9
rcParams['ytick.labelsize'] = 9
rcParams['legend.fontsize'] = 9
mm2inches = 0.039371
Shaping the data for plotting
features = features.apply(zscore)
features.head()
means = features.groupby(cluster_labels).apply(np.mean)
SEs = features.groupby(cluster_labels).apply(sem)
plt.figure(figsize=[75*mm2inches, 100*mm2inches], dpi=300)
colours = ["#3B4992FF", "#EE0000FF", "#008B45FF"]
for i, cluster in enumerate(np.unique(cluster_labels)):
plt.errorbar(features.columns, means.iloc[cluster, :].values, yerr=SEs.iloc[cluster],
linewidth=2, capsize=2, elinewidth=1, markeredgewidth=1, color=colours[i])
plt.ylim([-2, 2])
plt.ylabel('z-score')
sns.despine(offset=8)
# Adding the labels
ax = plt.gca()
ax.set_xticks(np.arange(0, len(features.columns)))
ax.set_xticklabels(data['feature_names'], rotation='90')
ax.set_yticks(np.linspace(-2, 2, 5))
# Add legend
legend = plt.legend(np.unique(cluster_labels), bbox_to_anchor=(1.05, 1.05), frameon=True)
legend.get_frame().set_edgecolor('k')
legend.get_frame().set_linewidth(0.5)
# Adding lines
plt.axhline(0, linestyle='solid', linewidth=0.5, color='k')
plt.axhline(1, linestyle='dashed', linewidth=0.5, color='k')
plt.axhline(-1, linestyle='dashed', linewidth=0.5, color='k')
Note on colour aesthetics: Creating effective colour maps is not an easy task. You want colours that work well together, create contrasts, and are accessible to colour-blind individuals. For most plots, I use colour maps from this project: https://nanx.me/ggsci/. For plots that require continuous colour scales, I use maps from the viridis project that are integrated in matplotlib. These are designed to be perceptually continuous (read more here: https://www.r-bloggers.com/ggplot2-welcome-viridis/)
You can play around with the code by commenting out some of the commands and observing the effect. You can either comment out a single line with a hash (#) or multiple lines with """
plt.figure(figsize=[75*mm2inches, 100*mm2inches], dpi=300)
colours = ["#3B4992FF", "#EE0000FF", "#008B45FF"]
for i, cluster in enumerate(np.unique(cluster_labels)):
plt.errorbar(features.columns, means.iloc[cluster, :].values, yerr=SEs.iloc[cluster],
linewidth=2, capsize=2, elinewidth=1, markeredgewidth=1, color=colours[i])
plt.ylim([-2, 2])
#plt.ylabel('z-score')
sns.despine(offset=8)
# Adding the labels
ax = plt.gca()
ax.set_xticks(np.arange(0, len(features.columns)))
ax.set_xticklabels(data['feature_names'], rotation='90')
ax.set_yticks(np.linspace(-2, 2, 5))
"""
# Add legend
legend = plt.legend(np.unique(cluster_labels), bbox_to_anchor=(1.05, 1.05), frameon=True)
legend.get_frame().set_edgecolor('k')
legend.get_frame().set_linewidth(0.5)
"""
# Adding lines
plt.axhline(0, linestyle='solid', linewidth=0.5, color='k')
plt.axhline(1, linestyle='dashed', linewidth=0.5, color='k')
plt.axhline(-1, linestyle='dashed', linewidth=0.5, color='k')
from sklearn.metrics.pairwise import euclidean_distances
matrix = euclidean_distances(features.values)
Sorting the correlation matrix according to the grouping
sorting_array = sorted(range(len(cluster_labels)), key=lambda k: cluster_labels[k])
sorted_array = matrix
sorted_array = sorted_array[sorting_array, :]
sorted_array = sorted_array[:, sorting_array]
import matplotlib.patches as patches
plt.figure(figsize=[100*mm2inches, 75*mm2inches], dpi=300)
sns.heatmap(sorted_array, vmin=0, cbar=True, cmap='viridis', square=True,
cbar_kws={'label': 'Euclidean distance'})
ax = plt.gca()
ax.set_xticks([]);
ax.set_yticks([]);
# Create a Rectangle patch
for cluster in np.unique(cluster_labels):
x = 0 + np.where(np.asarray(sorted(cluster_labels)) == cluster)[0][0]
y = 0 + np.where(np.asarray(sorted(cluster_labels)) == cluster)[0][0]
width = height = np.where(np.asarray(sorted(cluster_labels)) == cluster)[0][-1] + 1 - np.where(np.asarray(sorted(cluster_labels)) == cluster)[0][0]
rect = patches.Rectangle((x, y),
width,
height,
linewidth=0.5, edgecolor='w', facecolor='none')
ax.add_patch(rect)
This is only a very basic demonstration of the plots that you can generate with matplotlib and seaborn. The best way to get started is to use the examples on the respective websites and adjust the code to your purposes.
Dedicated Python courses:
Books:
There are some native packages for network visualization in Python that do a pretty decent job for simple plots, e.g. :
import bct
import networkx as nx
colours = ["#3B4992FF", "#EE0000FF", "#008B45FF"]
matrix = matrix/matrix.max()
matrix = 1-matrix
plotting_matrix = bct.threshold_proportional(matrix, 0.1)
G = nx.from_numpy_matrix(plotting_matrix)
pos = nx.kamada_kawai_layout(G)
for community in np.unique(cluster_labels):
nx.draw_networkx_nodes(G,pos,
nodelist=np.where(cluster_labels == community)[0].tolist(),
node_color=colours[int(community)-1],
node_size=40,
alpha=0.8)
nx.draw_networkx_edges(G,pos,width=0.1,alpha=0.5)
plt.axis('off');
plt.show();
However, there are much fancier ways of plotting graphs available from a dedicated application called Gephi. You can download it for free here: https://gephi.org
Gephi expects the data to be in a certain shape. First, we need to reshape the distance matrix:
pd.DataFrame(matrix).head()
pd.DataFrame(matrix).to_csv('/Users/joebathelt1/Desktop/test_gephi.csv')
Notice that the rows and columns have the same name. That is required for Gephi.
Next, we produce a second file with information about the cluster membership:
pd.DataFrame(vstack([pd.DataFrame(matrix).index, cluster_labels]).transpose(), columns=['ID', 'group']).head()
pd.DataFrame(vstack([pd.DataFrame(matrix).index, cluster_labels]).transpose(), columns=['ID', 'group']).to_csv('/Users/joebathelt1/Desktop/test_gephi_groups.csv', index=False)
This table contains one column with the ID (same as in the distance matrix) and a column with the associated cluster of that node.
Loading the data in Gephi:
from IPython.display import Image
Import the distance matrix
Image('./Gephi_steps/Step1.png')
Image('./Gephi_steps/Step2.png')
Select "Matrix" as the input type. Check that the table in the preview looks as expected.
Image('./Gephi_steps/Step3.png')
Select "Undirected" as the type - that's the most common in community detection
Image('./Gephi_steps/Step4.png')
You should now see the network in a random layout. You can play with this visualization. This matrix has not been thresholded yet and probably contains too many edges. You can threshold it interactively here and then apply an algorithmic layout.
Image('./Gephi_steps/Step5.png')
Select a threshold for the edge weight by dragging Edge Weight to the Filter Queries. Then use the slider to adjust the threshold. Choose a threshold at which the graph is still fully connected but not too dense.
Image('./Gephi_steps/Step6.png')
Next, we're importing the group labels. Select the Data Laboratory tab.
Image('./Gephi_steps/Step7.png')
Then, click Import Spreadsheet and select the group table.
Image('./Gephi_steps/Step8.png')
Select Nodes table as the input type
Image('./Gephi_steps/Step9.png')
On the last dialog window, choose Append to existing workspace. You should see a new column called "group"
Image('./Gephi_steps/Step10.png')
To see the group assignment in the graph visualization, choose the palette symbol and select "Partition" and select "group".
Image('./Gephi_steps/Step11.png')
You can adjust the colours to match the other figues. Use the hex code for exact colour matching.
Image('./Gephi_steps/Step12.png')
Once you're happy with the figure aesthetics, select the "Preview" tab to adjust the final exprt of the figure.
Image('./Gephi_steps/Step13.png')